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Abstract 

We consider the problem of embedding enti- 
ties and relations of knowledge bases in low- 
dimensional vector spaces. Unlike most ex- 
isting approaches, which are primarily effi- 
cient for modeling equivalence relations, our 
approach is designed to explicitly model ir- 
reflexive relations, such as hierarchies, by in- 
terpreting them as translations operating on 
the low-dimensional embeddings of the en- 
tities. Preliminary experiments show that, 
despite its simplicity and a smaller num- 
ber of parameters than previous approaches, 
our approach achieves state-of-the-art perfor- 
mance according to standard evaluation pro- 
tocols on data from WordNct and Freebase. 



1. Intoduction 

Multi-relational data, which refers to directed graphs 
whose nodes correspond to entities and edges repre- 
sent relations that link these entities, plays a pivotal 
role in many areas such as recommcndcr systems, the 
Semantic Web, or computational biology. Relations 
are modeled as triplets of the form (head, label, tail), 
where label indicates the type of link between the en- 
tities head and tail. Relations are thus of several types 
and can exhibit various properties (symmetry, tran- 
sitivity, irreflexivity, etc.). Such graphs are popular 
tools for encoding data via knowledge bases (KBs), se- 
mantic networks or any kind of database following the 
Resource Description Framework format. Hence, they 
are widely used in the Semantic Web (e.g. Freebase 1 
or Google Knowledge Graph but also for knowledge 
management in bioinformatics (e.g. GencOntology 2 ) 
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or natural language processing (e.g. WordNet 3 ). 

Despite their appealing ability for representing com- 
plex data, multi-relational databases remain compli- 
cated to manipulate because of the heterogeneity of 
the relations (frequencies, connectivity), their inherent 
noise (collaborative or semi-automatic creation) and 
their very large dimension (up to millions of entities 
and thousands of relation types). 

In this paper, we introduce a distributed model, which 
learns to embed such data in a vector space, where 
entities are modeled as low-dimensional embeddings. 
Many existing approaches (e.g. from Sutskever et al. 
(2009); Nickel ct al. (2011)) interpret relations as lin- 
ear transformations of these embeddings: when (h, £, t) 
holds, then the embeddings of head h and tail t should 
be close (in the embedding space) after transformation 
by a linear operator that depends on the label £. With 
such an interpretation, the model implies that the re- 
lation is reflexive since the embedding of h will always 
be its nearest neighbor, and because of the triangle 
inequality, the model will, to some extent, imply some 
form of transitivity of the relation. 

While this interpretation is fine for equivalence rela- 
tions (such as WordNet's .similar _to), it is inade- 
quate for irreflexive relations that represent hierar- 
chies, such as WordNet's Jiypernym or Freebase's type 
hierarchy. Indeed, taking the simplest example of en- 
tities organized in a tree with two relations, "sibling" 
and "parent", the embeddings of siblings should be 
close to each other (since it essentially is an equiva- 
lence relation), but enforcing the constraint that par- 
ent nodes should be close to their child nodes will lead 
the embedding of the whole tree to collapse to a small 
region of the space where the siblings and parent of a 
given node are impossible to distinguish. 
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Since hierarchical and irreflexive relations are ex- 
tremely common in KBs, we propose a simple model to 
efficiently represent them, by interpreting relations as 
translations in the embedding space: if (h, £, t) holds, 
then the embedding of t should be close to the embed- 
ding of h plus some vector that depends on £. This 
approach is motivated by the natural representation 
of trees (i.e. embeddings of the nodes in dimension 2): 
while siblings are close to each other and nodes at a 
given height are organized on the cc-axis, the parent- 
child relation corresponds to a translation on the y- 
axis. Another, secondary, motivation comes from the 
recent work of Mikolov et al. (2013), in which the au- 
thors learn word embeddings from free text, and some 
one-to-one relations between entities of different types, 
such as the relation "capital of" between countries and 
cities, are (coincidentally rather than willingly) repre- 
sented by the model as translations in the embedding 
space. Our approach may then be used in the context 
of learning word embeddings in the future to reinforce 
this kind a structure of the embedding space. 

Apart from the main line of algorithms to learn em- 
beddings of KBs, a number of recent approaches deal 
with the asymmetry of the relations at the expense 
of an explosion of model parameters. We present an 
empirical evaluation on data dumps of WordNet and 
Freebase, in which our model achieves strong results 
compared to such algorithms, with much fewer param- 
eters and even lower dimensional embeddings. 

In the remainder of the paper, we describe some of the 
related work in Section 2. We then describe our model 
in Section 3, and discuss its connections with related 
methods. We report preliminary experimental results 
on WordNet and Freebase in Section 4. We finally 
sketch some future work directions in Section 5. 

2. Related work 

Most previous methods designed to model relations 
in multi-relational data rely on latent representations 
or embeddings. The simplest form of latent at- 
tribute that can be associated to an entity is a la- 
tent class. Several clustering approaches have been 
proposed. Kemp et al. (2006) considered a non- 
parametric Bayesian extension of the stochastic block- 
model allowing to automatically infer the number of 
latent clusters; Kok & Domingos (2007) introduced 
clustering in Markov-Logic networks; Sutskever et al. 
(2009) used a non-parametric Bayesian clustering of 
entities embedding in a collective matrix factorization 
formulation. All these models cluster not only entities 
but relation labels as well. 



These methods can provide interpretations and anal- 
ysis of the data but are slow and do not scale to 
large databases, due to the high cost of inference. In 
terms of scalability, models based on tensor factor- 
ization (like those from (Harshman & Lundy, 1994) 
or (Nickel et al., 2011)) have shown to be efficient. 
However, they have been outperformed by energy- 
based models (Bordes et al., 2011; Jenatton et al., 
2012; Bordes et al., 2013; Chen et al., 2013). These 
methods represent entities as low-dimensional embed- 
dings and relations as linear or bilinear operators on 
them and are trained via an online process, which al- 
lows them to scale well to large numbers of entities 
and relation types. In Section 4, we compare our 
new approach to SE (Bordes et al., 2011) and SME 
(Bordes et al., 2013). 

3. Translation-based model 

We now describe our model and discuss its relationship 
to existing approaches. 

3.1. Our model 

Given a training set S of labeled arcs (h, £, t), our goal 
is to learn vector embeddings for all values of h, £ and 
t. We assume all nodes and labels appear at least 
once in the training set. The embeddings take values 
in M. k (k is a model hyperparameter) and are denoted 
with the same letter, in boldface characters. The basic 
idea behind our model is that the functional relation 
induced by the ^-labeled arcs corresponds to a trans- 
lation of the embeddings, i.e. we want that h + t » t 
when (h, £, t) holds, while h + £ should be far away 
from t otherwise. 

To learn such embeddings, we minimize the following 
margin-based ranking criterion over the training set: 

£ ]T [ 1 + d(h + £,t)-d(h' + Lt')] + (1) 

(h,e,ms(h',i,t')es[ het) 

where [x] + denotes the positive part of x, 7 > is 
a margin hyperparameter, d(x,y) is some dissimilar- 
ity function on K fe , e.g. the euclidian distance or the 
squared euclidian distance, and 

S[hu) = {VMW g N ) u {(h,£,t')\t' e N} . (2) 

The set of "negative" examples we sample according to 
Equation 2 is basically the training ( "positive" ) triple 
with either the head or tail replaced by a random entity 
(but not both at the same time). The loss function (1) 
favors low values of dissimilarity between head+label 
and tail for positive triplets, and large values for neg- 
ative triplets, and is thus a natural implementation of 
the intended criterion. 
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The minimization is carried out by stochastic gradi- 
ent descent, over the possible h,£ and t, with the ad- 
ditional constraints that the L2-norm of the embed- 
dings of the entities is 1 (no regularization or norm 
constraints are given to the label embeddings £). 

3.2. Relationship to previous approaches 

Section 2 described a large body of work on embedding 
KBs. We detail here the relationships between our 
model and those of Bordes ct al. (2011) (Structured 
Embeddings or SE) and Chen et al. (2013). 

SE (Bordes et al., 2011) embeds nodes into M. k , and la- 
bels into two matrices L\ eR kxk and L 2 G R kxk such 
that d(Lih,L 2 t) is small for positive triplets (h,£,t) 
(and large otherwise). The basic idea is that when 
two nodes belong to the same edge, their embeddings 
should be close to each other in some subspace that 
depends on the label. This basic idea would imply 
L\ = L 2 , and using two different projection matrices 
for the head and for the tail is intended to account for 
the possible asymmetry of relation L When the dis- 
similarity function takes the form of d(x, y) = g(x — y) 
for some g : R k — > R (e.g. g is a norm), then the model 
of SE with an embedding of size k + 1 is strictly more 
expressive than our model with an embedding of size 
k, since linear operators in dimension fc + 1 can repro- 
duce affine transformations in a subspace of dimension 
k (by constraining the fc+lst dimension of all node em- 
beddings to be equal to 1). SE, with L 2 as the identity 
matrix and L\ taken so as to reproduce a translation 
is then equivalent to our model. Despite the lower 
expressiveness of our model, we still reach better per- 
formance than this model in our experiments because 
(1) our model is a more direct way to represent the 
true properties of the relations, and (2) regularization, 
and more generally any form of capacity control, is 
difficult in embedding models ; greater expressiveness 
may then be more synonymous to overfitting than to 
better performance. 

Another related model is the Neural Tensor Model 
of Chen et al. (2013). A special case of that model 
(which would actually boil down to a "Neural Matrix 
Model") corresponds to learn scores s(h,£,t) (higher 
scores for positive triplets) of the form: 



Table 1. Statistics of the data sets used in this paper. 



s(h,£,t) = h T Lt 



£\h 



lit 



(3) 



where L e R kxk , L x 6 R k and L 2 el', all of them 
depending on L 

If we consider our model with the squared distance as 
dissimilarity function, we have: 

d{h+£,t) =\\h\\ 2 + \\£\\ 2 + \\t\\ 2 -2(h T t+£ T (t-h)) . 



Data set 



Entities 
Rel. types 
Train, ex. 
Valid ex. 
Test ex. 



WordNet Freebase 



40,943 

18 

141,442 

5,000 

5,000 



14,951 

1,345 

483,142 

50,000 

59,071 



Considering our norm constraints (|| h || 2 =|| t || 2 = 1) 
and the ranking criterion (1), in which || £ | 2 does 
not play any role in comparing positive and negatives 
triplets, our model thus corresponds to the scoring 
triplets according to h t + £ (t — h), and thus cor- 
responds to Chen et al. (2013)'s model (Equation (3)) 
where L is the identity matrix, and £ = £\ = —£ 2 . 
We could not run experiments with that model, but 
once again our model has much fewer parameters: this 
should ease the training and prevent overfitting, and 
hence compensate for a lower expressiveness. 

4. Experiments 

Our approach is evaluated against the methods SE and 
SME (Semantic Matching Energy) from (Bordes et al., 
2011; 2013) on two data sets and using the same rank- 
ing setting for evaluation. 

We measure the mean and median predicted ranks and 
the top-10, computed with the following procedure. 
For each test triplet, the head is removed and replaced 
by each of the entities of the dictionary in turn. En- 
ergies (i.e. dissimilarities) of those corrupted triplets 
are computed by the model and sorted by ascending 
order and the rank of the correct entity is stored. This 
whole procedure is also repeated when removing the 
tail instead or the head. We report the mean and me- 
dian of those predicted ranks and the top-10, which is 
the proportion of correct entities in the top 10 ranks. 

4.1. Data 

We used data from two KBs; their statistics are given 
in Table 1. 

WordNet This knowledge base is designed to pro- 
duce an intuitively usable dictionary and thesaurus, 
and support automatic text analysis. Its enti- 
ties (termed synsets) correspond to word senses, 
and relation types dehne lexical relations between 
them. We considered the data version used 

in (Bordes et al., 2013). Examples of triplets 
are (_score_iV7V_i, _hypernym, _evaluation_NN_l) or 
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Table 2. Some example predictions on the Freebase test set using our approach. Bold indicates the test triple's true tail 
and italics other true tails present in the training set. Actual Freebase identifiers have been replaced by readable strings. 



Input (Head and Label) 


Predicted Tails 


J. K. Rowling influenced by 


G. K. Chesterton, J. R. R. Tolkien, C. S. Lewis, Lloyd Alexander, 
Terry Pratchctt, Roald Dahl, Jorge Luis Borges, Stephen King, Ian Fleming 


Anthony LaPaglia performed in 


Lantana, Summer of Sam, Happy Feet, The House of Mirth, 
Unfaithful, Legend of the Guardians, Naked Lunch, X-Mcn, The Namesake 


Camden County adjoins 


Burlington County, Atlantic County, Gloucester County, Union County, 
Essex County, New Jersey, Passaic County, Ocean County, Bucks County 


The 40-Ycar-Old Virgin nominated for 


MTV Movie Award for Best Comedic Performance, 

BFCA Critics' Choice Award for Best Comedy, 

MTV Movie Award for Best On-Screen Duo, 

MTV Movie Award for Best Breakthrough Performance, 

MTV Movie Award for Best Movie, MTV Movie Award for Best Kiss, 

D. F. Zanuck Producer of the Year Award in Theatrical Motion Pictures, 

Screen Actors Guild Award for Best Actor - Motion Picture 


David Foster has the genre 


Pop music, Pop rock, Adult contemporary music, Dance music, 
Contemporary FtfcB, Soft rock, Rhythm and blues, Easy listening 


Costa Rica football team has position 


Forward, Defender, Midfielder, Goalkeepers, 
Pitchers, Infielder, Outfielder, Center, Dcfcnsc-man 


Lil Wayne born in 


New Orleans, Atlanta, Austin, St. Louis, 
Toronto, New York City, Wellington, Dallas, Puerto Rico 


WALL-E has the genre 


Animations, Computer Animation, Comedy film, 
Adventure film, Science Fiction, Fantasy, Stop motion, Satire, Drama 


Richard Crcnna has cause of death 


Pancreatic cancer, Cardiovascular disease, Meningitis, Cancer, 
Prostate cancers, Stroke, Liver tumour, Brain tumor, Multiple myeloma 



(score-NN_2, -has-part, _musicaLnotation-NN_l) . 4 



Table 3. Link prediction results on WordNet. 



Freebase Freebase is a huge and growing database 
of general facts; there are currently around 1.2 bil- 
lion triplets. To make a small data set to experi- 
ment on we selected the subset of entities that are 
also present in the Wikilinks database 5 and that also 
have at least 100 mentions in Freebase (for both en- 
tities and relations). We also removed negative re- 
lations like M/people/person/nationality' which just 
reverses the head and tail compared to the relation 
'/people/person/nationality'. This resulted in 592,213 
triplets with 14,951 entities and 1,345 relations which 
were randomized and split as shown in Table 1. 

4.2. Implementation 

We implemented our model using the SME library 6 , 
which already proposes code for SE and SME. The dis- 
similarity measure d was set to the L\ distance, mostly 
because it led to a faster training. 

For this preliminary set of experiments, we did not 
perform an extensive search for hyperparameters. For 
experiments of our method on WordNet, we fixed the 
learning rate for the stochastic gradient descent to 



4 WordNet is composed of senses, its entities are termed 
by the concatenation of a word, its part-of-speech tag and 
a digit indicating which sense it refers to i.e. _score_NNA 
encodes the first meaning of the noun "score" . 

J code .google . com/p/wiki-links 

6 https : //github . com/glorotxa/SME 



Method 


Rank 


Top-10 




Mean 


Med. 




Unstructured 


317 


26 


35.1% 


SE 


1,011 


3 


68.5% 


SME (linear) 


559 


5 


65.1% 


SME(bilinear) 


526 


8 


54.7% 


Our Approach 


263 


4 


75.4% 



0.01, the dimension k of the embeddings to 20 and cho- 
sen the margin 7 among {1, 2, 10} with the validation 
set (optimal value was 2). We report results for SE and 
SME extracted from (Bordes et al., 2013) where those 
models have been trained using a much more thorough 
hyperparameter search. For experiments on Freebase, 
we ran all experiments using the SME library with 
fixed values for the learning rate (= 0.01), k (= 50) 
and 7 (= 1). For both datasets, the training time was 
limited to at most 1, 000 epochs over the training set. 
The best model was selected using the mean predicted 
rank on the validation set. 

4.3. Results 

Tables 3 and 4 displays the results on both data sets 
for our method, compared to SE, to two versions of 
SME and to Unstructured, a simple model which only 
uses the dot-product between h and t as dissimilar- 
ity measure for a triplet (h,£,t), with no influence of 
£. Table 2 gives examples of nearest link prediction 
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Table 4. Link prediction results on Freebase. 



Method 


Rank 


Top-10 




Mean 


Med. 




Unstructured 


1097 


404 


4.5% 


SE 


272 


38 


28.8% 


S ME (linear) 


274 


34 


30.7% 


SME(bilinear) 


284 


35 


31.3% 


Our Approach 


243 


25 


34.9% 



results of our approach on the Freebase test set. 

Our method greatly outperforms all counterparts on 
all metrics, with particularly good results for the top- 
10 metric. We believe that such remarkable perfor- 
mance is due to an appropriate design of the model 
according to the data, but also to its relative simplic- 
ity. Hence, even if the problem is non-convex, it can 
be optimized efficiently with stochastic gradient. We 
showed in Section 3.2 that SE is more expressive than 
our proposal. However, its complexity makes it quite 
hard to train as shown in the results of tables 3 and 4. 

Table 2 illustrates the capabilities of our model. Given 
a head and a label, the top predicted tails (and the true 
one) are depicted. The examples come from the Free- 
base test set. Even if the good answer is not always 
top-ranked, the predictions reflect common-sense. 

5. Conclusion and future work 

We proposed a new approach to learn embeddings 
of KBs, focusing on the minimal parametrization of 
the model to accurately represent hierarchical and ir- 
reflexive relations. This short paper is essentially in- 
tended to be a proof-of-concept that translations are 
adequate to model such relations in a multi-relational 
setting. It can be improved and better validated 
in several ways. For the experimental evaluation, 
this paper is the first one to present link predic- 
tion on this dump of Freebase. More benchmark- 
ing is needed, such as the comparison with models of 
Chen ct al. (2013) and Jcnatton et al. (2012). We also 
intend to consider learning translations of word em- 
bedding, either from free text as in (Mikolov et al., 
2013) or from (subject, verb, object) triplets as in 
(Bordes et al., 2011). 

Finally, regarding modeling relations, equivalence re- 
lations in our approach are represented by a transla- 
tion vector, and thus enforces all members of an equiv- 
alence class to be close to each other in the embedding 
space (whatever the relation). Some additional de- 
grees of freedom may be given by adding a projection 



matrix to each relation, so that equivalence relations 
only enforce entities to be close to each other in some 
subspace of the embedding space. However, this would 
increase the number of parameters, and we believe that 
regularization and optimization techniques should be 
further studied to achieve optimal performance. 
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